NII Technical Report

نویسندگان

  • Nina Kummer
  • Christa Womser-Hacker
  • Noriko Kando
چکیده

Orthographic varieties are common in the Japanese language, and represent a serious problem for Japanese information retrieval (IR), as IR systems run the risk of missing documents that contain variant forms of the search term. We propose two different strategies for handling orthographic varieties: pronunciation or yomi-based indexing and “Fuzzy Querying”, comparing katakana terms based on edit distance. Both strategies were integrated into our multiple index and fusion system, and tested using two different test collections, newspaper articles (Mainichi Shimbun ’98) and scientific abstracts (NTCIR-1), to compare their performance across text genres. The fusion of the results obtained with a bi-gram-based, a word-based, and the additional yomi-based index was found to improve precision significantly for the NTCIR-1 collection, but only slightly for the Mainichi Shimbun ’98 collection. Adding Fuzzy Querying as a fourth system and merging the results led to a further, but not significant, improvement in precision.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Role of Intelligent Systems in the National Information Infrastructure

The National Information Infrastructure (NII) will have a profound effect on the education, lifestyle, and wellbeing of Americans from every corner of society. The infrastructure will transport critical information and software to every home, open educational and training opportunities to remote communities, and accelerate commerce by reducing the time to develop new products and increasing the...

متن کامل

Towards the ground truth: Exact algorithms for bioinformatics research (NII Shonan Meeting 2014-2)

This report is joint work of the attendees of the seminar (see Section 4) and has been edited by the organizers.

متن کامل

The Temperature of Extended Gas in Active Galaxies – Evidence for Matter-Bounded Clouds

We report measurements of the electron temperature at about a dozen locations in the extended emission-line regions of five active (Seyfert and radio) galaxies. Temperatures (T[OIII] and T[NII]) have been determined from both the I([OIII]λ4363)/I([OIII]λ5007) and I([NII]λ5755)/I[NII]λ6583) ratios. T[OIII] lies in the range (1.0 – 1.7) × 104K. We find a strong trend for T[OIII] to be higher than...

متن کامل

Virtual Infrastructure: Putting Information Infrastructure on the Technology Curve

The present debate concerning the National Information Infrastructure (NII)[1] has focused primarily on the introduction of competitive markets for the supply and distribution of information. Although competition will be an important component of the NII, and one which we welcome, we argue that it is inappropriate to frame the debate entirely in terms of competition. Competition can be seen as ...

متن کامل

Japanese Effort Toward Sharing Text and Speech Corpora

This report introduces the activities of the two organizations related to collection and distribution of text and speech corpora in Japan. One is the Language Resource Association (GSK) and the other is NII-Speech Resources Consortium (NII-SRC).

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005